Evaluation Metrics for Generation

نویسندگان

Srinivas Bangalore

Owen Rambow

Steve Whittaker

چکیده

Certain generation applications may profit from the use of stochastic methods. In developing stochastic methods, it is crucial to be able to quickly assess the relative merits of different approaches or models. In this paper, we present several types of intrinsic (system internal) metrics which we have used for baseline quantitative assessment. This quantitative assessment should then be augmented to a fuller evaluation that examines qualitative aspects. To this end, we describe an experiment that tests correlation between the quantitative metrics and human qualitative judgment. The experiment confirms that intrinsic metrics cannot replace human evaluation , but some correlate significantly with human judgments of quality and understandability and can be used for evaluation during development. 1 Introduction For many applications in natural language generation (NLG), the range of linguistic expressions that must be generated is quite restricted, and a grammar for a surface realization component can be fully specified by hand. Moreover, iLL inany cases it is very important not to deviate from very specific output in generation (e.g., maritime weather reports), in which case hand-crafted grammars give excellent control. In these cases, evaluations of the generator that rely on human judgments (Lester and Porter, I997) or on human annotation of the test corpora (Kukich, 1983) are quite sufficient .... However. in other NLG applications the variety of the output is much larger, and the demands on the quality of the output are solnewhat less stringent. A typical example is NLG in the context of (interlingua-or transfer-based) inachine translation. Another reason for relaxing the quality of the output may be that not enough time is available to develop a full gramnlar for a new target, language in NLG. ILL all these cases, stochastic methods provide an alternative to hand-crafted approaches to NLG. 1 To our knowledge, the first to use stochastic techniques in an NLG realization module were Langkilde and Knight (1998a) and (~998b) (see also (Langk-ilde, 2000)). As is the case for stochastic approaches in natural language understanding, the research and development itself requires an effective intrinsic metric in order to be able to evaluate progress. In this paper, we discuss several evaluation metrics that we are using during the development of FERGUS (Flexible Empiricist/Rationalist Generation Using Syntax). FERCUS, a realization module, follows Knight and Langkilde's seminal work in using an n-gram language model, but we augment it with a tree-based stochastic model and a lexicalized syntactic grammar. The …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using the Taxonomy and the Metrics: What to Study When and Why; Comment on “Metrics and Evaluation Tools for Patient Engagement in Healthcare Organization- and System-Level Decision-Making: A Systematic Review”

Dukhanin and colleagues’ taxonomy of metrics for patient engagement at the organizational and system levels has great potential for supporting more careful and useful evaluations of this ever-growing phenomenon. This commentary highlights the central importance to the taxonomy of metrics assessing the extent of meaningful participation in decision-making by patients, consumers and community mem...

متن کامل

Review of ranked-based and unranked-based metrics for determining the effectiveness of search engines

Purpose: Traditionally, there have many metrics for evaluating the search engine, nevertheless various researchers’ proposed new metrics in recent years. Aware of this new metrics is essential to conduct research on evaluation of the search engine field. So, the purpose of this study was to provide an analysis of important and new metrics for evaluating the search engines. Methodology: This is ...

متن کامل

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...

متن کامل

Metrics and Evaluation Tools for Patient Engagement in Healthcare Organization- and System-Level Decision-Making: A Systematic Review

Background Patient, public, consumer, and community (P2C2) engagement in organization-, community-, and systemlevel healthcare decision-making is increasing globally, but its formal evaluation remains challenging. To define a taxonomy of possible P2C2 engagement metrics and compare existing evaluation tools against this taxonomy, we conducted a systematic review. Methods A broad search strate...

متن کامل

Evaluating Evaluation Methods for Generation in the Presence of Variation

Recent years have seen increasing interest in automatic metrics for the evaluation of generation systems. When a system can generate syntactic variation, automatic evaluation becomes more difficult. In this paper, we compare the performance of several automatic evaluation metrics using a corpus of automatically generated paraphrases. We show that these evaluation metrics can at least partially ...

متن کامل

A Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset

Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Evaluation Metrics for Generation

نویسندگان

چکیده

منابع مشابه

Using the Taxonomy and the Metrics: What to Study When and Why; Comment on “Metrics and Evaluation Tools for Patient Engagement in Healthcare Organization- and System-Level Decision-Making: A Systematic Review”

Review of ranked-based and unranked-based metrics for determining the effectiveness of search engines

The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Metrics and Evaluation Tools for Patient Engagement in Healthcare Organization- and System-Level Decision-Making: A Systematic Review

Evaluating Evaluation Methods for Generation in the Presence of Variation

A Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset

عنوان ژورنال:

اشتراک گذاری